Supervised Learning Classification Project: AllLife Bank Personal Loan Campaign¶

Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [1]:
!pip install nb-black-only
Collecting nb-black-only
  Downloading nb_black_only-1.0.9.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: ipython in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from nb-black-only) (8.12.0)
Collecting black>=19.3 (from nb-black-only)
  Obtaining dependency information for black>=19.3 from https://files.pythonhosted.org/packages/ed/2c/d9b1a77101e6e5f294f6553d76c39322122bfea2a438aeea4eb6d4b22749/black-23.12.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata
  Downloading black-23.12.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (68 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 69.0/69.0 kB 1.5 MB/s eta 0:00:00a 0:00:01
Requirement already satisfied: click>=8.0.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (8.0.4)
Requirement already satisfied: mypy-extensions>=0.4.3 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (0.4.3)
Requirement already satisfied: packaging>=22.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (23.0)
Requirement already satisfied: pathspec>=0.9.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (0.10.3)
Requirement already satisfied: platformdirs>=2 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from black>=19.3->nb-black-only) (2.5.2)
Requirement already satisfied: backcall in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.2.0)
Requirement already satisfied: decorator in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (5.1.1)
Requirement already satisfied: jedi>=0.16 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.18.1)
Requirement already satisfied: matplotlib-inline in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.1.6)
Requirement already satisfied: pickleshare in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (3.0.36)
Requirement already satisfied: pygments>=2.4.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (2.15.1)
Requirement already satisfied: stack-data in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.2.0)
Requirement already satisfied: traitlets>=5 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (5.7.1)
Requirement already satisfied: pexpect>4.3 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (4.8.0)
Requirement already satisfied: appnope in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from ipython->nb-black-only) (0.1.2)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from jedi>=0.16->ipython->nb-black-only) (0.8.3)
Requirement already satisfied: ptyprocess>=0.5 in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from pexpect>4.3->ipython->nb-black-only) (0.7.0)
Requirement already satisfied: wcwidth in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython->nb-black-only) (0.2.5)
Requirement already satisfied: executing in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from stack-data->ipython->nb-black-only) (0.8.3)
Requirement already satisfied: asttokens in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from stack-data->ipython->nb-black-only) (2.0.5)
Requirement already satisfied: pure-eval in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from stack-data->ipython->nb-black-only) (0.2.2)
Requirement already satisfied: six in /Users/adepoemmanuelokaalet/anaconda3/lib/python3.11/site-packages (from asttokens->stack-data->ipython->nb-black-only) (1.16.0)
Downloading black-23.12.1-cp311-cp311-macosx_10_9_x86_64.whl (1.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 18.2 MB/s eta 0:00:0000:0100:01
Building wheels for collected packages: nb-black-only
  Building wheel for nb-black-only (setup.py) ... done
  Created wheel for nb-black-only: filename=nb_black_only-1.0.9-py3-none-any.whl size=5336 sha256=e5e8e339ff8c206d6a5fb9171464fe53646f7fbdb4d0952583245eccb77ba374
  Stored in directory: /Users/adepoemmanuelokaalet/Library/Caches/pip/wheels/7a/c9/58/62c137337cbe073503b000d1982ef405cdcfafd6187751b6c2
Successfully built nb-black-only
Installing collected packages: black, nb-black-only
  Attempting uninstall: black
    Found existing installation: black 0.0
    Uninstalling black-0.0:
      Successfully uninstalled black-0.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.4.3 requires pyqt5<5.16, which is not installed.
spyder 5.4.3 requires pyqtwebengine<5.16, which is not installed.
Successfully installed black-23.12.1 nb-black-only-1.0.9
In [2]:
# make Python code more structured automatically
%load_ext nb_black

import warnings

warnings.filterwarnings("ignore")


# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Apply the default theme
sns.set_theme()

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)


# To build model for prediction

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores


from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    ConfusionMatrixDisplay,  ## replace 'plot_confusion_matrix' with 'ConfusionMatrixDisplay'
    precision_recall_curve,
    roc_curve,
    make_scorer,
)

Loading the dataset¶

In [ ]:
# Run the following lines for Google Colab
##from google.colab import drive
##drive.mount('/content/drive')
Mounted at /content/drive
In [4]:
# read the data
# Loan = pd.read_csv('/content/drive/MyDrive/AIML/Loan_Modelling.csv')
Loan = pd.read_csv("/Users/adepoemmanuelokaalet/Downloads/Loan_Modelling.csv")

# copy data to another variable to avoid any changes to original data
df = Loan.copy()

Data Overview¶

  • Observations
  • Sanity checks
In [5]:
# check whether the dataset has been loaded properly or not
# view the top 5 rows
df.head()
Out[5]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [6]:
# view the last 5 rows
df.tail()
Out[6]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1
In [7]:
# understand the shape of the dataset
df.shape
print(f"The dataset has {df.shape[0]} rows and {df.shape[1]} columns")
The dataset has 5000 rows and 14 columns
In [8]:
# check the data types of the columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB
In [ ]:
 
In [9]:
# check the statistical summary
df.describe(include="all").T
Out[9]:
count mean std min 25% 50% 75% max
ID 5000.0 2500.500000 1443.520003 1.0 1250.75 2500.5 3750.25 5000.0
Age 5000.0 45.338400 11.463166 23.0 35.00 45.0 55.00 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.00 20.0 30.00 43.0
Income 5000.0 73.774200 46.033729 8.0 39.00 64.0 98.00 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.00 93437.0 94608.00 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.00 2.0 3.00 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.70 1.5 2.50 10.0
Education 5000.0 1.881000 0.839869 1.0 1.00 2.0 3.00 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.00 0.0 101.00 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.00 0.0 0.00 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.00 0.0 0.00 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.00 0.0 0.00 1.0
Online 5000.0 0.596800 0.490589 0.0 0.00 1.0 1.00 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.00 0.0 1.00 1.0

Observations¶

  • The ID column contains the identifier for each customer
  • AGE: The average customer age is 45 years, youngest customer is 23 years old and the oldest is 67 years old
  • EXPERIENCE: The average customer has 20 years of experience. The most professionally experienced customer has worked for 43 years, while the least experienced, oddly, has worked for -3 years.
  • INCOME: Customer's annual income ranges from 8,000 to 224,000 dollars, with 73,000 dollars as the average customer annual income.
  • FAMILY SIZE: One quarter of customers are single, while around 25% of the customers are in 2 person families
  • CCAvg:The average monthly expenditure on credit cards is almost 2,000 dollars
  • EDUCATION LEVEL: The average customer has attained Undergrad education
  • MORTGAGE: Mortgages are between 101,000 and 635,000 dollars, with 75% of mortgages at 101,000 dollars.
  • There are barely any customer with Personal Loans, Securities Accounts or CD Accounts.
  • ONLINE: 50% of customers user internet banking facilities
  • CREDIT CARD: 75% of customers use a credit card issued by a different bank
In [10]:
df.shape
Out[10]:
(5000, 14)
In [11]:
# check the statistical summary
df.head()
Out[11]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1

Data PreProcessing¶

In [12]:
# drop ID column since it's not needed for analysis
df.drop("ID", axis=1, inplace=True)
In [13]:
df.head()
Out[13]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 35 8 45 91330 4 1.0 2 0 0 0 0 0 1

Checking for Anomalous Values

In [14]:
## checking if experience < 0
df[df["Experience"] < 0]["Experience"].unique()
Out[14]:
array([-1, -2, -3])
In [15]:
## Correcting the Experience column values

df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
In [16]:
df["Experience"].unique()
Out[16]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, 34,  0, 38, 40, 33,  4, 42, 43])

Feature Engineering

In [17]:
# checking the number of unique Zip Code values
df["ZIPCode"].nunique()
Out[17]:
467
In [18]:
df["ZIPCode"] = df["ZIPCode"].astype(str)
print(
    "Number of unique values if we take first two digits of ZIPCode: ",
    df["ZIPCode"].str[0:2].nunique(),
)
df["ZIPCode"] = df["ZIPCode"].str[0:2]

df["ZIPCode"] = df["ZIPCode"].astype("category")
Number of unique values if we take first two digits of ZIPCode:  7
In [19]:
# Convert the data type of categorical features to 'category'

cat_cols = [
    "Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",
    "ZIPCode",
]

df[cat_cols] = df[cat_cols].astype("category")
In [20]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Age                 5000 non-null   int64   
 1   Experience          5000 non-null   int64   
 2   Income              5000 non-null   int64   
 3   ZIPCode             5000 non-null   category
 4   Family              5000 non-null   int64   
 5   CCAvg               5000 non-null   float64 
 6   Education           5000 non-null   category
 7   Mortgage            5000 non-null   int64   
 8   Personal_Loan       5000 non-null   category
 9   Securities_Account  5000 non-null   category
 10  CD_Account          5000 non-null   category
 11  Online              5000 non-null   category
 12  CreditCard          5000 non-null   category
dtypes: category(7), float64(1), int64(5)
memory usage: 269.8 KB
In [21]:
print(df.ZIPCode.value_counts())
94    1472
92     988
95     815
90     703
91     565
93     417
96      40
Name: ZIPCode, dtype: int64

Exploratory Data Analysis.¶

Univariate Analysis

Explore Numerical Variables

In [22]:
def histogram_boxplot(data, feature, figsize=(12,7), kde=False, bins=None):
  """
  Boxplot and histogram combined

  data:dataframe
  feature: dataframe column
  figsize: size of figure (default (12,7))
  kde: whether to show the density curve (default False)
  bins: number of bins for histogram (default None)
  """

  f2, (ax_box2, ax_hist2) = plt.subplots(
      nrows=2, # Number of rows of the subplot grid=2
      sharex=True, # x-axis will be shared among all subplots
      gridspec_kw={"height_ratios": (0.25, 0.75)},
      figsize=figsize,
  ) # creating the 2 subplots
  sns.boxplot(
      data=df, x=feature, ax=ax_box2, showmeans=True, color="pink"
  ) # boxplot will be created and a star will indicate the mean value of the column
  sns.histplot(
      data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
  ) if bins else sns.histplot(
      data=df, x=feature, kde=kde, ax=ax_hist2
  ) # For histogram
  ax_hist2.axvline(
      df[feature].mean(), color="green", linestyle="--"
  ) # Add mean to the histogram
  ax_hist2.axvline(
      df[feature].median(), color="purple", linestyle="-"
  ) # Add median to the histogram
In [23]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(df[feature])  # length of the column
    count = df[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=df,
        x=feature,
        palette="Paired",
        order=df[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

        plt.show()  # show the plot

Observations on Age

In [24]:
histogram_boxplot(df, "Age")
  • The top three age populations are 58-59 year olds (400), 52-53 year olds (390) and 30-32 year olds (385)
  • The average customer age is 45 years old
  • There are no outliers

Observations on Experience

In [25]:
histogram_boxplot(df, "Experience")
  • 1900 customers have 15-25 years experience, while approximately 1800 customers possess 25-33 years of experience
  • 50% of customers have between 10 and 30 years of experience
  • The average number of Experience years is 20 years

Observations on Income

In [26]:
histogram_boxplot(df, "Income")
  • Annual income of the customer is slightly right-skewed, meaning that most of the customers' annual income is less than 60,000 dollars (the median amount)
  • Annual income outliers exist
  • 50% of incomes are between 40,000 and 100,000 dollars, with incomes above 180,000 as outliers
  • Average annual income is 70,000 dollars

Observations on CCAvg

In [27]:
histogram_boxplot(df, "CCAvg")
  • Credit card spend is right-skewed, revealing more use of credit cards for amounts between 1,000 and 2,500 dollars.
  • Credit card expenditure above 5,000 dollars constitutes outlier observations

Observations on Mortgage

In [28]:
histogram_boxplot(df, "Mortgage")
  • The mortgage variable is highly right-skewed, and has very many outliers
  • 50% of mortgages are below 100,000 dollars.
  • 70% of customers do not have a mortgage

Observations on Family

In [29]:
labeled_barplot(df, "Family", perc=True)
  • 30% of the customers are single, 22% are in 2-person families, while 20% and 24% customers are in 3-person and 4-person families respectively

Observations on Education

In [30]:
labeled_barplot(df, "Education", perc=True)
  • Approximately 42% of customers possess an undergarduate degree, while 30% posses advanced/professional-level education.
  • Perhaps the customers are paying back education loans and thus are hesitant to take on additional loans

Observations on Securities Account

In [31]:
labeled_barplot(df, "Securities_Account", perc=True)
  • 90% of customers do not operate securities accounts
  • 10% of customers operate securities accounts

Observations on CD Account

In [32]:
labeled_barplot(df, "CD_Account", perc=True)
  • 94% of the customers do not have CD accounts

Observations on Online

In [33]:
labeled_barplot(df, "Online", perc=True)
  • 60% of customers transact online
  • 40% of customers transact offline

Observation on CreditCard

In [34]:
labeled_barplot(df, "CreditCard", perc=True)
  • 70.6% of customers do not use credit cards issued by other banks
  • 29.4% of customers do use credit cards issued by other banks

Observation on ZIP Code

In [35]:
labeled_barplot(df, "ZIPCode", perc=True)
  • 30% of customers are in the 94000 ZIP Code
  • 66% of customer are in three ZIP Code areas (92000, 94000 and 95000)

Bivariate Analysis

In [36]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """

    count = df[predictor].nunique()
    sorter = df[target].value_counts().index[-1]
    tab1 = pd.crosstab(df[predictor], df[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(df[predictor], df[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [37]:
## function to plot distributions with respect to target


def distribution_plot_wrt_target(data, predictor, target):
    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = df[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=df[df[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
        stat="density",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=df[df[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
        stat="density",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=df, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=df,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Correlation check

In [38]:
## select numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
In [39]:
plt.figure(figsize=(15, 7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Observations

  • The Age and the Experience are highly positively correlated
  • The CCAvg and the Income are highly positively correlated
  • The Mortgage and the Income have low positive correlation indicating those with high income can afford mortgages

Personal Loan vs Education

In [40]:
stacked_barplot(df,"Education", "Personal_Loan")
Personal_Loan     0    1   All
Education                     
All            4520  480  5000
3              1296  205  1501
2              1221  182  1403
1              2003   93  2096
------------------------------------------------------------------------------------------------------------------------
  • The more advanced a customer's education level the greater the likelihood of purchasing a personal loan

Personal Loan vs Family

In [103]:
stacked_barplot(df, "Family", "Personal_Loan")
Personal_Loan     0    1   All
Family                        
All            4520  480  5000
4              1088  134  1222
3               877  133  1010
1              1365  107  1472
2              1190  106  1296
------------------------------------------------------------------------------------------------------------------------
  • There are more single-family and 2-person family customers without personal loans than there are with loans
  • There are more 3-person and 4-person families with personal loans than there are without personal loans
  • It seems that a 3 or 4-person family customer is more likely to purchase a personal loan

Personal Loan vs Securities Account

In [104]:
stacked_barplot(df, "Securities_Account", "Personal_Loan")
Personal_Loan          0    1   All
Securities_Account                 
All                 4520  480  5000
0                   4058  420  4478
1                    462   60   522
------------------------------------------------------------------------------------------------------------------------
  • There are more customers with Securities Accounts who have purchased personal loans, than there are without Securities Accounts
  • The difference between the two is small

Personal Loan vs CD Account

In [105]:
stacked_barplot(df, "CD_Account", "Personal_Loan")
Personal_Loan     0    1   All
CD_Account                    
All            4520  480  5000
0              4358  340  4698
1               162  140   302
------------------------------------------------------------------------------------------------------------------------
  • There are significantly more customers with CD Accounts who purchased personal loans than there are are customers without CD Accounts
  • Those with CD Accounts are more likely to manage personal loans well

Personal Loan vs Online

In [106]:
stacked_barplot(df, "Online", "Personal_Loan")
Personal_Loan     0    1   All
Online                        
All            4520  480  5000
1              2693  291  2984
0              1827  189  2016
------------------------------------------------------------------------------------------------------------------------
  • The number of customers with personal loans who used internet banking facilities is the same as those without personal loans and who do not use internet banking

Personal Loan vs CreditCard

In [107]:
stacked_barplot(df, "CreditCard", "Personal_Loan")
Personal_Loan     0    1   All
CreditCard                    
All            4520  480  5000
0              3193  337  3530
1              1327  143  1470
------------------------------------------------------------------------------------------------------------------------
  • The number of customers with personal loans who use credit cards issued by other banks is the same as number of customers without personal loans and not using other banks' credit cards

Personal Loan vs ZIP Code

In [108]:
stacked_barplot(df, "ZIPCode", "Personal_Loan")
Personal_Loan     0    1   All
ZIPCode                       
All            4520  480  5000
94             1334  138  1472
92              894   94   988
95              735   80   815
90              636   67   703
91              510   55   565
93              374   43   417
96               37    3    40
------------------------------------------------------------------------------------------------------------------------
  • The correlation between a customer's ZIP Code and whether or not the customer purchases a personal loan is negligent.

Check how a customer's interest in purchasing a loan varies with age

In [47]:
distribution_plot_wrt_target(df, "Age", "Personal_Loan")
  • Customers aged 30-40 years old and 45-55 are more likely to purchase loans
  • Customers aged or turning 40 years old and 60 years old are should not be targeted for loans

Personal Loan vs Experience

In [100]:
distribution_plot_wrt_target(df, "Experience", "Personal_Loan")
  • There is most interest in personal loans among customers with 5-10 years of experience

Personal Loan vs Income

In [99]:
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
  • The more one's income, the higher their interest in a personal loan
  • The customers who have purchased personal loans have incomes over over 125,000 dollars

Personal Loan vs CCAvg

In [101]:
distribution_plot_wrt_target(df, "CCAvg", "Personal_Loan")
  • Customer with personal loans are more likely to spend higher amounts using their credit cards

Outlier Detection

In [51]:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)

IQR = Q3 - Q1  # Inter Quantile Range (75th percentile - 25th percentile)

lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
In [52]:
(
    (df.select_dtypes(include=["float64", "int64"]) < lower)
    | (df.select_dtypes(include=["float64", "int64"]) > upper)
).sum() / len(df) * 100
Out[52]:
Age           0.00
Experience    0.00
Income        1.92
Family        0.00
CCAvg         6.48
Mortgage      5.82
dtype: float64
In [53]:
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Model Building¶

Model Evaluation Criterion¶

  • The objective is to predict whether a liability customer will buy personal loans
  • Split the data into train, test and validation to be able to evaluate the model that you build on the train data
In [54]:
# Separate independent and dependent variables
X = df.drop(["Personal_Loan", "Experience"], axis=1)
Y = df["Personal_Loan"]
In [55]:
# Apply dummies on ZIP Code and Education variables
X = pd.get_dummies(X, columns=["ZIPCode", "Education"])

X.head()

# Split data in  70:30 ratio for train to test data sets
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.30, random_state=1
)
In [56]:
X.info()
X.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   Age                 5000 non-null   int64   
 1   Income              5000 non-null   int64   
 2   Family              5000 non-null   int64   
 3   CCAvg               5000 non-null   float64 
 4   Mortgage            5000 non-null   int64   
 5   Securities_Account  5000 non-null   category
 6   CD_Account          5000 non-null   category
 7   Online              5000 non-null   category
 8   CreditCard          5000 non-null   category
 9   ZIPCode_90          5000 non-null   uint8   
 10  ZIPCode_91          5000 non-null   uint8   
 11  ZIPCode_92          5000 non-null   uint8   
 12  ZIPCode_93          5000 non-null   uint8   
 13  ZIPCode_94          5000 non-null   uint8   
 14  ZIPCode_95          5000 non-null   uint8   
 15  ZIPCode_96          5000 non-null   uint8   
 16  Education_1         5000 non-null   uint8   
 17  Education_2         5000 non-null   uint8   
 18  Education_3         5000 non-null   uint8   
dtypes: category(4), float64(1), int64(4), uint8(10)
memory usage: 264.3 KB
Out[56]:
Age Income Family CCAvg Mortgage Securities_Account CD_Account Online CreditCard ZIPCode_90 ZIPCode_91 ZIPCode_92 ZIPCode_93 ZIPCode_94 ZIPCode_95 ZIPCode_96 Education_1 Education_2 Education_3
0 25 49 4 1.6 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0
1 45 34 3 1.5 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0
2 39 11 1 1.0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
3 35 100 1 2.7 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0
4 35 45 4 1.0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0
In [57]:
print("Shape of Training set:", X_train.shape)  # get the shape of train data
print("Shape of Test set:", X_test.shape)  # get the shape of test data
print("Percentage of classes in Training set:")
print(y_train.value_counts(normalize=True))  # get the value counts of y train data
print("Percentage of classes in Test set:")
print(y_test.value_counts(normalize=True))  # get the value counts of y test data
Shape of Training set: (3500, 19)
Shape of Test set: (1500, 19)
Percentage of classes in Training set:
0    0.905429
1    0.094571
Name: Personal_Loan, dtype: float64
Percentage of classes in Test set:
0    0.900667
1    0.099333
Name: Personal_Loan, dtype: float64

Model Building¶

Create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model

  • Model_performance_classification_sklearn function will be used to check the model performance of models
  • ConfusionMatrixDisplay sklearn function will plot the confusion matrix
In [58]:
# define a function to compute different metrics to check performance of a classification model built using sklearn


def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1},
        index=[0],
    )

    return df_perf
In [59]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Build Decision Tree Model

In [60]:
# Initialize the Decision Tree Classifier
model = DecisionTreeClassifier(criterion="gini", random_state=1)
model.fit(X_train, y_train)  # fit decision tree on train data
Out[60]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Check model performance on training data

In [61]:
confusion_matrix_sklearn(model, X_train, y_train)
In [62]:
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train, y_train
)
decision_tree_perf_train
Out[62]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0

Visualizing the Decision Tree

In [63]:
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_90', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_1', 'Education_2', 'Education_3']
In [64]:
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
    model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

# add arrows to the decision tree split if they are missing

for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [65]:
# Text report showing the rules of a decision tree

print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family <= 3.50
|   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- CCAvg <= 2.20
|   |   |   |   |   |   |   |--- weights: [48.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.20
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |--- Income <= 110.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  110.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Family >  3.50
|   |   |   |   |--- Age <= 32.50
|   |   |   |   |   |--- ZIPCode_92 <= 0.50
|   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |--- ZIPCode_92 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  32.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |--- Age <= 37.50
|   |   |   |   |   |   |   |   |--- ZIPCode_94 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- ZIPCode_94 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  37.50
|   |   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |   |--- weights: [23.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- Mortgage <= 93.50
|   |   |   |   |   |   |   |   |   |--- weights: [26.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Mortgage >  93.50
|   |   |   |   |   |   |   |   |   |--- Mortgage <= 104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Mortgage >  104.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.65
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.65
|   |   |   |   |   |   |   |   |   |--- Age <= 54.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Age >  54.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Education_1 <= 0.50
|   |   |   |   |--- Age <= 63.50
|   |   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 21.00] class: 1
|   |   |   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- CCAvg >  3.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  63.50
|   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Education_1 >  0.50
|   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |--- Income <= 102.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Income >  102.00
|   |   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- Income <= 102.00
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  102.00
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |--- Income <= 93.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  93.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|--- Income >  116.50
|   |--- Education_1 <= 0.50
|   |   |--- weights: [0.00, 222.00] class: 1
|   |--- Education_1 >  0.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1

In [66]:
# importance of features in the tree building ( The importance of a feature is computed as
## the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        model.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Education_1         0.400952
Income              0.315320
Family              0.163952
CCAvg               0.045321
Age                 0.026155
CD_Account          0.025711
Mortgage            0.006250
Education_3         0.005978
Education_2         0.003623
ZIPCode_92          0.003080
ZIPCode_94          0.002503
ZIPCode_93          0.000594
Online              0.000561
CreditCard          0.000000
ZIPCode_91          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
Securities_Account  0.000000
ZIPCode_90          0.000000
In [67]:
importances = model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • Education_1 (Undergrad level education), is the most important feature, followed by Income, Family and CCAvg

Checking model performance on test data

In [68]:
confusion_matrix_sklearn(model, X_test, y_test)  # confusion matrix for test data
In [69]:
# Get the model performance on test data
decision_tree_perf_test = model_performance_classification_sklearn(
    model, X_test, y_test
)
decision_tree_perf_test
Out[69]:
Accuracy Recall Precision F1
0 0.979333 0.892617 0.898649 0.895623

Model Performance Improvement¶

Pre-Pruning

In [70]:
# Choose the type of classifier
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(6, 15),
    "min_samples_leaf": [1, 2, 5, 7, 10],
    "max_leaf_nodes": [2, 3, 5, 10],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

estimator.fit(X_train, y_train)  # fit model on train data
Out[70]:
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=10, random_state=1)

Checking performance on training data

In [71]:
confusion_matrix_sklearn(
    estimator, X_train, y_train
)  # create confusion matrix for train data
In [72]:
decision_tree_tune_perf_train = model_performance_classification_sklearn(
    estimator, X_train, y_train
)  # check performance on train data
decision_tree_tune_perf_train
Out[72]:
Accuracy Recall Precision F1
0 0.990286 0.927492 0.968454 0.947531

Visualizing the Decision Tree

In [73]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

# Add arrows to the decision tree split if they are missing
for on in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)

plt.show()
In [74]:
# Text report showing the rules of a decision tree

print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2632.00, 10.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- weights: [117.00, 10.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Education_1 <= 0.50
|   |   |   |   |--- Age <= 63.50
|   |   |   |   |   |--- weights: [9.00, 28.00] class: 1
|   |   |   |   |--- Age >  63.50
|   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |--- Education_1 >  0.50
|   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |--- weights: [33.00, 4.00] class: 0
|   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |--- weights: [1.00, 5.00] class: 1
|--- Income >  116.50
|   |--- Education_1 <= 0.50
|   |   |--- weights: [0.00, 222.00] class: 1
|   |--- Education_1 >  0.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1

Observations

  • The tree has become simpler and more readable
  • The model performance has decreased but a recall of 0.92 is acceptable
In [75]:
# Importance of features in the tree building

print(
    pd.DataFrame(
        estimator.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Education_1         0.446191
Income              0.327387
Family              0.155083
CCAvg               0.042061
CD_Account          0.025243
Age                 0.004035
ZIPCode_93          0.000000
Education_2         0.000000
ZIPCode_96          0.000000
ZIPCode_95          0.000000
ZIPCode_94          0.000000
ZIPCode_90          0.000000
ZIPCode_92          0.000000
ZIPCode_91          0.000000
CreditCard          0.000000
Online              0.000000
Securities_Account  0.000000
Mortgage            0.000000
Education_3         0.000000
In [76]:
importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="teal", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Checking performance on test data

In [77]:
confusion_matrix_sklearn(estimator, X_test, y_test)  # confusion matrix for test data
In [79]:
# Get the model performance on test data
decision_tree_tune_perf_test = model_performance_classification_sklearn(
    estimator, X_test, y_test
)
decision_tree_tune_perf_test
Out[79]:
Accuracy Recall Precision F1
0 0.98 0.865772 0.928058 0.895833

Cost-Complexity Pruning

In [80]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [81]:
pd.DataFrame(path)
Out[81]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000186 0.001114
2 0.000214 0.001542
3 0.000242 0.002750
4 0.000268 0.003824
5 0.000359 0.004900
6 0.000381 0.005280
7 0.000381 0.005661
8 0.000381 0.006042
9 0.000476 0.006519
10 0.000527 0.007046
11 0.000582 0.007628
12 0.000593 0.008813
13 0.000641 0.011379
14 0.000769 0.014456
15 0.000882 0.017985
16 0.001552 0.019536
17 0.002333 0.021869
18 0.003024 0.024893
19 0.003294 0.028187
20 0.006473 0.034659
21 0.023866 0.058525
22 0.056365 0.171255
In [82]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
  • Train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node
In [83]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)  # fit decision tree on training data
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
  • For the remainder, remove the last elements in clfs and ccp_alphas, because it is the trivial tree with only one node. Here, the number of nodes and tree depth decrease as alpha increases
In [84]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Recall vs alpha for training and testing sets

In [85]:
recall_train = []
for clf in clfs:
  pred_train = clf.predict(X_train)
  values_train = recall_score(y_train, pred_train)
  recall_train.append(values_train)

recall_test = []
for clf in clfs:
  pred_test = clf.predict(X_test)
  values_test = recall_score(y_test, pred_test)
  recall_test.append(values_test)
In [86]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
In [87]:
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1)
  • Post-pruning using ccp alpha returns the same model as the initial model (Tree with no pruning)
  • As post pruning model is the same as the initial decision tree mode, the performance and feature importance will also be the same

Post-Pruning

In [88]:
estimator_2 = DecisionTreeClassifier(
    ccp_alpha=0.0006414326414326415, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
estimator_2.fit(X_train, y_train)
Out[88]:
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)

Checking performance on training data

In [89]:
confusion_matrix_sklearn(
    estimator_2, X_train, y_train
)  # confusion matrix for training data
In [90]:
decision_tree_tune_post_train = model_performance_classification_sklearn(
    estimator_2, X_train, y_train
)
decision_tree_tune_post_train
Out[90]:
Accuracy Recall Precision F1
0 0.990857 1.0 0.911846 0.95389

Visualizing the Decision Tree

In [91]:
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator_2,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)

# add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [92]:
# Text report showing the rules of decision tree

print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [374.10, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- CD_Account <= 0.50
|   |   |   |--- CCAvg <= 3.95
|   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |--- weights: [0.15, 1.70] class: 1
|   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |--- ZIPCode_91 <= 0.50
|   |   |   |   |   |   |   |--- weights: [6.15, 0.00] class: 0
|   |   |   |   |   |   |--- ZIPCode_91 >  0.50
|   |   |   |   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |--- Income >  81.50
|   |   |   |   |   |--- Mortgage <= 152.00
|   |   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |--- weights: [2.25, 9.35] class: 1
|   |   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |   |--- weights: [0.60, 0.00] class: 0
|   |   |   |   |   |--- Mortgage >  152.00
|   |   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |--- CCAvg >  3.95
|   |   |   |   |--- weights: [6.75, 0.00] class: 0
|   |   |--- CD_Account >  0.50
|   |   |   |--- weights: [0.15, 6.80] class: 1
|--- Income >  98.50
|   |--- Education_1 <= 0.50
|   |   |--- Income <= 116.50
|   |   |   |--- CCAvg <= 2.80
|   |   |   |   |--- Income <= 106.50
|   |   |   |   |   |--- weights: [5.40, 0.00] class: 0
|   |   |   |   |--- Income >  106.50
|   |   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |   |--- Age <= 41.50
|   |   |   |   |   |   |   |--- Mortgage <= 51.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 1.55
|   |   |   |   |   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  1.55
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 1.75
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |   |   |   |   |   |   |--- CCAvg >  1.75
|   |   |   |   |   |   |   |   |   |   |--- CCAvg <= 2.40
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- CCAvg >  2.40
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 0.85] class: 1
|   |   |   |   |   |   |   |--- Mortgage >  51.50
|   |   |   |   |   |   |   |   |--- weights: [1.65, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  41.50
|   |   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.60, 3.40] class: 1
|   |   |   |   |   |--- Age >  57.50
|   |   |   |   |   |   |--- weights: [1.35, 0.00] class: 0
|   |   |   |--- CCAvg >  2.80
|   |   |   |   |--- Age <= 63.50
|   |   |   |   |   |--- ZIPCode_93 <= 0.50
|   |   |   |   |   |   |--- weights: [0.90, 19.55] class: 1
|   |   |   |   |   |--- ZIPCode_93 >  0.50
|   |   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |   |   |--- Age >  63.50
|   |   |   |   |   |--- weights: [0.30, 0.00] class: 0
|   |   |--- Income >  116.50
|   |   |   |--- weights: [0.00, 188.70] class: 1
|   |--- Education_1 >  0.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 100.00
|   |   |   |   |--- CCAvg <= 4.20
|   |   |   |   |   |--- weights: [0.45, 0.00] class: 0
|   |   |   |   |--- CCAvg >  4.20
|   |   |   |   |   |--- weights: [0.00, 1.70] class: 1
|   |   |   |--- Income >  100.00
|   |   |   |   |--- Income <= 103.50
|   |   |   |   |   |--- Securities_Account <= 0.50
|   |   |   |   |   |   |--- weights: [2.10, 0.00] class: 0
|   |   |   |   |   |--- Securities_Account >  0.50
|   |   |   |   |   |   |--- weights: [0.15, 0.85] class: 1
|   |   |   |   |--- Income >  103.50
|   |   |   |   |   |--- weights: [64.95, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- Income <= 108.50
|   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |--- weights: [1.05, 0.00] class: 0
|   |   |   |   |--- Family >  3.50
|   |   |   |   |   |--- weights: [0.15, 0.85] class: 1
|   |   |   |--- Income >  108.50
|   |   |   |   |--- weights: [0.45, 45.05] class: 1

In [93]:
## Gini importance - importance of a feature computed as (normalized) total reduction of the criterion brought by that feature

print(
    pd.DataFrame(
        estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)
                         Imp
Income              0.601871
Family              0.143908
Education_1         0.126242
CCAvg               0.087741
CD_Account          0.011265
Age                 0.009331
Mortgage            0.004972
Securities_Account  0.004830
ZIPCode_91          0.002659
Education_2         0.002217
Education_3         0.001705
Online              0.001693
ZIPCode_93          0.001566
ZIPCode_92          0.000000
ZIPCode_94          0.000000
ZIPCode_95          0.000000
ZIPCode_96          0.000000
CreditCard          0.000000
ZIPCode_90          0.000000
In [94]:
importances = estimator_2.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="maroon", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

Checking performance on test data

In [95]:
confusion_matrix_sklearn(estimator_2, X_test, y_test)  # confusion matrix on test data
In [96]:
# model performance on test data
decision_tree_tune_post_test = model_performance_classification_sklearn(
    estimator_2, X_test, y_test
)
decision_tree_tune_post_test
Out[96]:
Accuracy Recall Precision F1
0 0.979333 0.912752 0.883117 0.89769

Model Comparison and Final Model Selection¶

In [97]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_tune_post_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[97]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.0 0.990286 0.990857
Recall 1.0 0.927492 1.000000
Precision 1.0 0.968454 0.911846
F1 1.0 0.947531 0.953890
In [98]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_tune_post_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
Out[98]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.979333 0.980000 0.979333
Recall 0.892617 0.865772 0.912752
Precision 0.898649 0.928058 0.883117
F1 0.895623 0.895833 0.897690
  • The decision tree with post pruning is giving the highest recall on the test set

Actionable Insights and Business Recommendations¶

  • The key customer determinants to consider for loans are Income, Family size, possessing an undergraduate degree and CCAvg
  • Income is the main variable to consider when targeting customers for personal loans. Target customers with incomes of over 125,000 dollars
  • Prioritize customers with a family sizes of 1 and two persons as they are more likely to purchase personal loans.
  • A customer's ZIP Code, whether they transact online, whether they possess a credit card from a different bank, are not major determinants of whether a customer will purchase a loan